Eureka? Identifying and Locating Objects on Ancient Greek Vases

Final Project for CS209B Advanced Topics in Data Science

Project Authors: Abdulla Saif, Mark Conmy, Ibrahim Ouf, Lucas Kitzmüller

May 10, 2021

Eureka (Ancient Greek: εὕρηκα) is an interjection used to celebrate a discovery or invention.



Outline

1. Introduction

Return to contents

1.1 Research Question

Return to contents

There are massive quantities of images of Greek vases available to researchers. However, the value of these databases remains limited if researchers have no practical way of searching and sorting these images. For example, if a classicist who is interested in the representations of Dionysos or Herakles does not have an efficient method of compiling vases with images depicting these objects, they would miss out on a research opportunity. Currently, labeling these images is a time-intensive manual process, and as a result, many images do not have labels or labels that provide insufficient details needed for more complex analyses.

Developing machine learning models that detect whether and where objects of interest (e.g., Dionysos or Herakles) are depicted on a vase is a complex problem. Greek vase painting is a uniform field with consistent representations of the objects. However, the vases, and more importantly, the available images of the vases vary considerably in terms of perspective, level of zoom, brightness, image dimensions, etc. In addition, computer vision to date has mostly focused on real-world images such as ImageNet. Therefore, learnings may not transfer perfectly to Greek vase paintings in which objects are shown in two dimensions using incisions (black figure vases) or lines (red figure vases).

The original problem description is available here.

1.2 Summary of Findings

Return to contents

We first scraped the images and metadata from all collections in the Arms and Armor database. This dataset includes 165,219 images from 66,649 vases. Based on an exploratory analysis of the existing labels, we decided to focus on four objects: Dionysos, Herakles, Athena, and Hermes. Given the low consistency and quality of images in the database, we then manually selected a subset of consistent images for training and testing. This process involved three steps: (i) manually removing low quality images (e.g., images that just show sketches), (ii) manually creating four labels indicating whether the image depicts objects of interest, and (iii) applying a “smart-cropper” based on the Open Computer Vision package to standardize the paintings and obtain as consistent a depiction as possible. The final dataset we used for training includes 2424 images in total of which 1881 images depict at least one of the four objects.

We then use the image labels (i.e. weakly supervised learning) to create a model that detects whether and where any of the objects are depicted on a vase. Because of our limited number of training images, we rely on transfer learning — in particular using the Inception network as our backbone feature detector. Our architecture is based on the concept of utilizing global average pooling (GAP) for object detection, as described by Zhou et al (2016). The models perform well in detecting whether objects are located on the image. The F-1 scores of the predictions of our preferred model on the test data range from 0.38 (Hermes) to 0.8 (Dionysos). The models perform less well in localizing objects on the images. For tagging purposes, perhaps it is unnecessary to properly localize the objects, but doing so assists humans in confirming the model’s answers. The use of heatmaps could prove sufficient to assist in automatically labelling archives, but masks and bounding boxes rely on subjective cut-offs. The model’s greatest strength is its sparse requirements for training, requiring to just know whether an object is somewhere on the image rather than where. This suits available data within the archives.

We also test using unsupervised learning. In particular, we create an introspective variational autoencoder (IntroVAE) model as described by Huang et al. 2019. The motivation is that if the autoencoder reconstructs images at high quality, we can apply a clustering algorithm to the latent space to detect emergent categories. If the event classes/clusters align with meaningful interpretations (e.g., does the image show Dionysos?), then we could also generate vectors from that region in the latent space to create more datapoints for training. Unfortunately, since the reconstructions are of poor quality, this approach did not work. Tuning hyperparameters and expanding the dataset could still produce better results but both training the model and manually labelling images proved too time-intensive.


1.3 Supporting Notebooks and Manually Labeled Data

Return to contents

This notebook provides a summary of the completed work. Additional notebooks with the code for web scraping, additional EDA, and modelling are available here on OneDrive. Apart from the manual labeling of images, all results presented in this notebook can be reproduced by running the notebooks in sequential order. The raw images scraped from the Arms and Armor database as well as manually labelled and cropped images data are saved in the same folder. Throughout the report we provide links to the relevant supporting notebooks and data for reference.

1.4 Data Source and Scraping

Return to contents

We scraped the images and metadata from all collections from the Arms and Armor database: the Beazley Archive, the British Museum, the Harvard Art Museums. We used the code shared by the teaching staff of the course but made some adjustments. For example, we downloaded the images in batches to make the process more robust to connectivity disruptions. The scraping code can be found here.

The scraped data set includes 165,219 images from 66,649 vases. The number of vases is lower than the reported 115,655 items on the Arms and Armor website as many links were broken. The scraped data, organized in batches, is also available on OneDrive,

1.5 Import libraries

Return to contents

2. Exploratory Data Analysis

Return to contents

Motivation for EDA

We had two primary objectives for the exploratory data analysis of the images metadata. First, we wanted to understand how the vases in the dataset vary across different chracteristics. This analysis informed some of our decisions in creating a high quality dataset described as in section 3 (e.g., focusing on Amphoe-type vases). Second, based on the Decoration label, we wanted to identify canonical and ubiquitous objects in the paintings that we could train a model to detect.

Loading data

2.1 Distribution of Vase Characteristics

Return to contents

Findings from Distribution Plots

This basic exploratory analysis of each vase’s metadata showed that 6 out of ten vases are Athenian, by far the biggest fabric category. 40% of paintings use the red-figure technique and 30% the black-figure technique. Vases are very heterogeneous in terms of shape, with the most common shape being Lekythos (10%). Half of the vases date from –525 to –425 BC. Most vases are labeled as gray. Vases’ origins are highly diverse, at least based on the uncleaned provenance string data. An important finding is also that, ‘decoration’, the primary label of the vase paintings, is missing for two-thirds of all images in our dataset.

2.2 Word Clouds of Labels

Return to contents

To identify canonical and well-presented objects in the paintings that we could train a model to detect, we tokenized the ‘decoration’ label and frequency of each word (see unigram word cloud).

Findings from Word Clouds

Some of the most frequent words describe scenes rather than objects (e.g., “seated”) or are too generic to yield a meaningful classification (e.g., “body”).

Based on these considerations, we decided to focus on the following four objects in our project:

  1. Dionysos, the Olympian god of wine
  2. Heracles, a Greek hero and the son of Zeus and Alcmene.
  3. Athena, the goddess of wisdom and war
  4. Hermes, the messenger of the gods

3. Creating a High-Quality Dataset

Return to contents

3.1 Manual Labeling and Basic Image Quality Screening

Return to contents

Motivation

Unfortunately, decoration, the primary label of the vase paintings, is missing for two-thirds of all images. In addition, during EDA, we noticed that the labels are inaccurate: for example, we found many images showing Dionysos with his typical attributes, but the decoration label did not include him. We also found some labels that do include Dionysos but the image does not. This contamination can prevent any model from effectively detecting Dionysos and his key features.

Further, the vase images vary considerably. For example, some images show only sketches, are zoomed in on a particular element, are taken from a high angle, or are too dark to recognize any objects. This lack of uniformity can hinder effective training, as models might end up focusing on backgrounds or other elements within the images rather than the actual painting.

Process

We therefore decided to manually screen images for quality and create labels indicating whether they depict the objects of interest. To ensure the final sample includes sufficiently many images depicting objects of interest, we first tokenized the decoration label and selected the images that, at least according to the original label, included one or more of the objects of interest. We then went manually through these pictures and checked that (a) the painting on the vase is clearly visible in the image and (b) added a label for each of the four objects of interest (Athena, Dionysos, Herakles, and Hermes). We followed the same process for some of the images that either had no decoration label or for which the decoration label did not mention any of the four objects to construct a test set.

The final labeled training dataset includes 2424 total images. 1881 images depicting at least one object of interest, and 543 that do not. Within the 1881 labeled images, there are 590 Athena, 758 Dionysos, 686 Herakles, and 231 Hermes. Some images show multiple objects of interest. The test set has 182 pictures with 27 observations that contain Athena, 55 Dionysos, 32 Herakles, 17 Hermes, and 74 contain none of the objects.

The final dataset is available here.

We explore this data briefly below.

Overview of Vase Types

The gallery below shows all available types. The purpose of the smart cropper developed further below is to extract the paintings from a large variety of vase types. This is clearly a challenging task.

All four objects of interest (Athena, Dionysos, Herakles, Hermes) generally occur more often in isolation than with other objects of interest. We note, however, that especially Athena tends co-occur a lot with Herakles. This is unsurprising as she helped Herakles during his 12 labors (source). Athena and Hermes also often appear in the same painting, as do Herakles and Hermes.

3.2 Smart-Cropping

Return to contents

Overview

Despite the basic quality screening in step 3.1, one of the dataset’s greatest limitations remains the mixture of zoomed out images depicting the entire vases—i.e. the painting and vase body—and those depicting just the scenes/paintings. Furthermore, the vases differ in their dimensions, so the paintings also differ in their dimensions. Thus, we implemented a “smart-cropper” to standardize the paintings and obtain as consistent a depiction as possible.

We used the Open Computer Vision package to create our smart cropper. By thresholding the image by color, we are able to obtain a mask of the vase if the image is zoomed out, allowing us to separate foreground from background. Then, we utilize edge detection to draw a bounding box around the vase (if not zoomed in). From there, taking advantage of the fact vases of the same type have similar dimensions with their paintings located in comparable positions, we crop out vase-specific portions of the bounded box (manually calibrated for each vase type). To account for the fact some images depict only the painting while others are already sufficiently zoomed in, we take advantage of the fact said images almost always have greater width than height to filter them out. We also utilize the ratio of the areas of the bounding box and whole images to filter out images already sufficiently zoomed in. The end result is a far cleaner set of consistent (though not completely perfect) images depicting the paintings on the vases.

Below are the manually calibrated parameters to extract out of the bounding-box/region of interest within the vases. The cropping function is shown below. We allow for some exceptions based on a combination of vase type and extracted bounding base. This occurs because some vase types differ in dimensions more than others.

Demonstration of Smart-Cropping Process

Apply Smart-Cropper to Training Set

Apply Smart-Cropper to Null and Test Set

4. Weakly Supervised Objected Detection and Classification

Return to contents

4.1 Overview

Return to contents
Using nothing but image-level labels (i.e. weakly supervised learning), we create a model that successfully detects whether an object of interest (i.e. Dionysos, Herakles, Athena, Hermes or any combination of them) is depicted on a vase and where. Because of our limited number of training images, we rely on transfer learning—in particular using the Inception network as our backbone feature detector. Our architecture is based on the concept of utilizing global average pooling (GAP) for object detection, as described by Zhou et al (2016): Learning Deep Features for Discriminative Localization.

The key idea is to create class activation maps (CAMs) from the feature maps generated by GAP. The CAMs are computed as a weighted average of those feature maps. The process is shown below, or here in case this notebook is not viewed on OneDrive (the image is from the paper):

title

We then detect the objects by thresholding the CAMs. We utilize networks pre-trained on ImageNet as the backbone convolutional model, which is then fed into a GAP layer. We show this successfully overcame our limited and messy dataset. We note that we are tackling a multi-label rather than multi-class problem. The objects we seek to detect frequently occur together, presenting an interesting multilabel challenge.

We relied on the following sources:

1) https://github.com/ray0809/weakly-supervised-object-localization

2) http://emaraic.com/blog/weakly-supervised-detection

3) https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf

4) https://github.com/tahaemara/weakly-supervised-detection

4.2 Training and Testing Data

Return to contents

4.3 The Models

Return to contents

We create 4 models: 3 variations using the Inception network and one using Resnet. By cutting off the topmost layers of these networks, we can append our GAP layer followed by a dense layer for multi-label classification. One challenge is balancing the pre-trained networks’ ability to detect features based on the ImageNet dataset and adjusting their weights for our own task. By freezing layers, we retain the ability to detect features but risk not adjusting the weights enough for our task. Hence, we must freeze an appropriate number of upper layers and properly balance low-level feature detection common across all images and macro features representative of our dataset. Thus, we experiment with 3 models utilizing Inception as the backbone but differing in the number of frozen layers: 20 topmost layers frozen, completely unfrozen and 50 topmost frozen. We also experimented with other networks to serve as the backbone, and as an example, display the performance of a model using Resnet as the backbone (which did not perform as well).

We also attempted to create our own convolutional model (a simplified VGG network) to serve as the backbone feature detector. The model failed to perform well, so we omit it for brevity. This displays the importance of transfer learning to overcome our limited dataset.

Because we are creating a multi-label model, our activation function is sigmoid (rather than softmax used in multi-class classification). Our loss function is binary cross-entropy (rather than categorical cross-entropy used in multi-class classification).

4.4 Augmentation and Training

Return to contents

As a reminder, our training data consists of 2424 total images. 1881 images depict at least one object of interest, and 543 that depict no object of interest. This is a small dataset, so we are augmenting it.

We apply 6 varieties of augmentation and explain our reasoning for doing so: 1) Rotation: The dataset’s images are generally taken with the vases standing in an upright position, but some of the images are slightly tilted. Hence, we augment our data with minor rotations (up to 8 degrees), but not so much to result in images of vases lying down (i.e. 45-degree rotation) for example.

2) Horizontal flipping: Objects may be depicted on either side of the vase, and they may face in either direction (though it appears Dionysos in more likely to face rightwards, for example). All vase types are symmetric, so horizontal rotation is appropriate.

3) Zooming: An important difference between images is in how zoomed in they are/closely depict the scene or object—representing one of the largest issues in the lack of data standardization even post smart-cropping. Thus, we apply a generally wide zoom range (up to 20%), but not so much that the object is no longer visible.

4) Horizontal & vertical shifts: The training images are not perfectly centered, even post-cropping. Thus, and also to prevent overfitting, we apply moderate shifting amounting to 10% of the image range.

5) Shearing: The training images are taken for a wide variety of perspectives, e.g. from the sides versus straight on. Shearing allows us to take this into account.

4.5 Examining Performance

Return to contents

4.6 Visual Inspection: Single Object

Return to contents

We investigate our model’s ability to localize objects. Note that object detection performance is measured through metrics such as the Jacard Index (i.e. the overlap between the predicted object’s location and its actual location divided by the union) or DICE (i.e. equivalent to F1; two times the overlap between prediction and actual divided by the total number of pixels). However, we lack pixel-level labelling to compute such metrics; therefore it is the reason why we opted for weakly supervised object detection. Thus, we will rely on visual inspection.

Despite our lack of pixel-level labels, we believe our approach has merit in understanding not just the ability to detect the object itself, but the context that accompanies these objects. As we shall see, Dionysos is frequently accompanied by vines, and the model relies on these vines in determining whether Dionysos is depicted and where. Similarly, Herakles is often seen wrestling animals, so the presence of animals assists in detecting his presence.

The model correctly determines whether Dionysos is somewhere in the image, and correctly notes no other object is within the image. It generally performs well in localizing where Dionysos is, but it does not capture his full body. There is a degree of arbitrariness in segmenting the image based on thresholding the heatmaps, one of the model’s weaknesses. Regardless, we view this prediction as a general success, especially given its faded nature and good localization precision and recall.

The model correctly determines whether Herakles is somewhere in the image, and correctly notes no other object is within the image. It focuses on the interaction of Herakles and the lion (depictions of them wrestling is quite common). Localization is not particularly good in terms of recall (i.e. finding Herakles’ whole body), but precision (i.e. not including irrelevant) can be regarded as quite good if we regard the lion as part of Herakles’ depiction. Because Herakles is frequently depicted wrestling animals, the prediction raises the question whether a human would have determined Herakles was being depicted without the lion with which he is wrestling? Is it incorrect to consider the lion part of the Herakles depiction?

We see one of the model’s weaknesses is that segmentation is not guaranteed to be contiguous, due to the arbitrary nature of determining cut-offs for the CAMs. Localization in this example is mixed, showing a portion of Athena’s body with some irrelevant parts. It failed to detect Herakles, despite his appearance on the vase.

The model correctly detected Athena and Herakles but failed to see Hermes. Object localization is good in terms of recall, but not precision. This is possibly a consequence of these objects co-occurring, so the model understands both objects are there, but not who is who.

We examine Dionysos-class CAMs for Dionysos images to understand what features the model most closely examines. We see that the presence of two vines surrounding a robed person’s body is a major giveaway. The model tends to focus on lower portions of the images with objects’ feet and the vines. We see for some images the model incorrectly focuses on another person or just the vines. We see that Dionysos is depicted consistently, typically standing up with a kantharos but occasionally sitting down or riding a chariot. The model appears to perform worse in the rarer sort of scenes.

Athena is customarily portrayed wearing body armor and a helmet and carrying a shield and a lance. The Athena class CAMs for Athena images suggest the model generally focused on these elements for detection – e.g., in several pictures the shield and lance are clearly highlighted. The model had more difficulty with Athena’s uncommon depictions, for example riding a chariot (middle image, upper row).

We examine Herakles-class CAMs for Herakles images to understand what features the model most closely examines. We noted previously the model struggles more with Herakles test images, and the worse performance compared to Dionysos is shown in object detection. In 5 images (2, 4, 8, 9 and 10), it localizes Herakles well, with his figure falling in the red zone. In other images, the model appears to be looking at objects frequently depicted with him, such as soldiers, horses or beasts. We also see that Herakles is depicted in a wider range than Dionysos (which could be studied through text analysis of available descriptions): this contributes to Herakles being relatively harder to detect.

As noted earlier, there are relatively fewer training images depicting Hermes, and our models did not perform well in detecting this figure. The Hermes CAMS also reveal somewhat of a lack in consistency in what elements the model is focusing on for making a prediction. Interestingly, the model appears to somewhat focus on shields even though Hermes is typically not depicted with a shield. A likely explanation is that Hermes is often shown with Athena who often carries a shield. Therefore, the model wrongly picks up on shields as an identifying feature of Hermes.

4.7 Visual Inspection: Multiple Objects

Return to contents

We examine our model's ability to detect objects simultaneously with greater emphasis in this section.

The sample above depicts a surprising success. Though the model does not localize the objects well, as displayed by the bounding box, it surprisingly places those boxes correctly on the objects. The model displays good understanding of the differences between these two objects.

We see that the CAMs display coherent distinctions. The model correctly notes that the left part of the image corresponds more closely with Dionysos and the right part with Athena. It is partly confusing Herakles with Athena, who frequently appears with him in the same images. Regardless, this provides more evidence of proper object detection.

The CAMs above show the model’s value in detecting where objects are, even if exact localization is not achieved. The heatmaps for Dionysos and Herakles strongly overlap with where the objects actually are (in the middle and on the left). Hermes (far right) remains elusive, however, and the model does not know where to look. It understands that if a woman and Herakles are depicted together, that woman is likely Athena.

4.7 Strengths and Weaknesses

Return to contents

1) For most objects, the model performs well in detecting whether they are in the image but less successful in localizing them. For tagging purposes, perhaps it is unnecessary to properly localize the objects, but doing so assists humans in confirming the model’s answers. The use of heatmaps could prove sufficient to assist in automatically labelling archives, as masks and bounding boxes rely on subjective cut-offs.

2) The model’s greatest strength is its sparse requirements, requiring to just know whether an object is somewhere on the image rather than where. This suits pre-existing archives’ limitations.

3) Whether the model relies on particular objects or general contexts is unclear. It appears to rely on a mix of the two—just as humans do. For example, the presence of vines is a strong indication of Dionysos’s presence. On the other hand, the combination of men and animals is more indicative of Herakles, rather than the man himself.

4) The model generally succeeds in showing the approximate location of objects, but is not precise in identifying the exact area.

4.8 Future Work

Return to contents

There are several potential avenues to further improve this work:

1) We could implement more sophisticated weakly supervised models (e.g. self-attention) as done in Huang et al. (2020).

2) We could combine unsupervised object segmentation with heatmaps to better locate images (e.g. find the hottest area, locate in which segment it falls, and using that as the detected object). For example, by first segmenting the objects using unsupervised methods, then determining the “hottest” areas in the CAMs, and finally mapping the hottest areas to the segmented image, we could force the instance segmentation to be contiguous.

3) We could incorporate a wider array of objects, especially non-humans.

4) We could use transfer learning from more similar datasets, such as water paintings or comic books (Gonthier et al. 2018).

5) We could also further improve the quality of the training data by improving cropping and potentially adding synthetic images of objects that frequently co-occur with other objects (e.g. Hermes).

5. Introspective Variational AutoEncoder (IntroVAE)

Return to contents

5.1 Motivation

Return to contents

In this section, we create an introspective variational autoencoder (IntroVAE) model as described by Huang et al. 2019. The motivation for this approach is that if the autoencoder reconstructs images at high quality, we can apply a clustering algorithm to the latent space to detect emergent categories. If the event classes/clusters align with meaningful interpretations (e.g., does the image show Dionysos?), then we could also generate vectors from that region in the latent space to create more datapoints for training.

IntoVAEs have the same qualities of a VAE, but train the encoder and the generator parts of the model iteratively on different loss functions. Both the encoder loss and generator loss include minimizing the MSE between the original image and the generated image after passing through both the encoder and generator. The encoder loss also includes KL divergence loss, KL divergence loss after freezing the weights for the generator and encoding the output a second time, as well as KL divergence loss after encoding a randomly generated latent space freezing the weights for the generator and encoding the output a second time. The generator loss also includes KL divergence loss and KL divergence loss after passing a randomly generated latent space through the generator and encoder.

Our work draws in particular on section "3.3 Training IntroVAE networks" on pp. 5-6 and "C Illustration of training flow" on p. 15 in Huang et al. 2019.

5.2 Custom Losses

Return to contents

Alpha and beta are weights for KL divergence losses and the MSE between input and output respectively. Code for custom loss and custom IVAE class adapted from here.

5.3 Custom Classes

Return to contents

The "Sampling" class allows the model to learn a mean and variance for the distribution underlying the embedding layer by sampling an epsilon while training the model subject to: $\epsilon$ ~ $N(0,1)$

And constructing the embedding layer: $embedding = mean + (\epsilon * variance)$

5.4 Encoder

Return to contents

5.5 Generator

Return to contents

5.6 Loading data

Return to contents

5.7 Optimizers and Fitting Model

Return to contents

5.8 Results

Return to contents

Unfortunately, the reconstructed images are blurry, and it is not possible to detect objects on them.

Overall the decay in loss for both the encoder and generator is highly unstable, but trending downward.

5.9 Comparison with Baseline Model

Return to contents

Unfortunately, the reconstruction of the images with the IVAE is worse than with the baseline VAE model – the pictures are blurrier, and the RMSE is higher.

5.10 Summary and Future Work

Return to contents

Our idea was that if we can reconstruct images at high-quality via an autencoder, we can apply a clustering algorithm to the latent space to detect emergent categories. Unfortunately, since the reconstructed images are of poor quality, this apporach did not work.

In the analysis conducted for Milestone 2 we found that there wasn't meaningful clustering of points in the latent space created with the baseline VAE model. Since the reconstruction from the IVAE is worse, both visually and by RMSE on the validation set, we expect that the IVAE will not demonstrate emergent clustering in a meaningful way either.

Moving forward, it will be interesting to see if the IVAE could eventually outperform the VAE if trained for enough epochs. Both were trained for 100 epochs, but maybe the VAE would hit a performance limit at some point where it could no longer minimize the loss. Also, time did not permit extensive experimentation with the hyperparameters for the encoder and generator losses for the IVAE which made a significant difference in outcomes in the IntroVAE paper.